Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Thu Nov 19 21:48:27 2020"

Still a bit confused, I got back from fieldwork in Central Finland on Thursday and took the rest of the week off. But excited to start the course! Link to repo: https://github.com/vilnat/IODS-project


Week 2 - Regression and model validation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Thu Nov 19 21:48:27 2020"

Dataset

d14 <- read.table("data/learning2014.csv")
str(d14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(d14)
## [1] 166   7

This dataset contains information about a survey study conducted in a intro course to social statistics in the University of Helsinki. The study focused on the relationship between learning approaches and students' achievements. The dataset contains background information of each student, such as the gender, age and how well they did in the course exam. It also contains aggregated information about the survey results in columns "deep", "stra" and "surf" which correspond to deep, strategic and surface learning.

This plot shows a summary of the variables in the dataset. The line graphs in the middle of the plot show the distribution in the data. Most of the data apart from background information is fairly normally distributed. Aggregated deep learning variable is a bit skewed towards high values. The scatter plots and numerical values surrounding the line graphs show correlations inside the data. Most strongly correlated are attitude towards statistics and how well the student did in the exam (positive correlation) as well as deep and surface learning (negative correlation).

library(ggplot2)
library(GGally)

p <- ggpairs(d14, mapping = aes(alpha = 0.5), lower = list(combo = wrap("facethist", bins = 20, colour = "cornflower blue")), diag = list(continuous = wrap("densityDiag", colour = "cornflower blue")))
p

Finally, here is a summary of the variables:

summary(d14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Linear model

Here I built a linear model that explains exam success based on the variables attitude, strategic learning and age. I chose these variables based on backward elimination method in which you add all explanatory variables in the model and delete one by one the least significant variables until you're left with variables that are statistically significant.

The summary below shows the statistical aspects of the model. The residuals are relatively well normally distributed. Of the explanatory variables, attitude is clearly the most significant statistically. It has a strong positive influence on test results. While you could spend a while discussing whether ~0.05 is 'good enough', I chose to keep strategic learning and age in the model as they improve the r2 value. Higher strategic learning value in the model leads to higher test results whereas age has a slightly negative influence on them. Overall the model explains approximately 22 % of the exam results (multiple r2 = 0.2182) so a large part of the variation is still left unexplained.

m <- lm(points ~ attitude + stra + age, data = d14)
summary(m)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = d14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

Linear models make certain assumptions concerning the nature of the data and the relationships between the model and the dependent variable. Model residuals show the difference between actual observations and the model. Linear models assume that these residuals are normally distributed and that their values don't depend on the dependent variable. The top-left figure below (residuals vs. fitted) shows the spread of the residuals across the fitted values. There doesn't seem to be any clear pattern although some of the residuals are somewhat separated from the general lump. Don't really know what is happening there but these residuals also aren't dependent on the fitted values.

The top-right figure shows how normally the residuals are distributed. For the most part, the residuals are close to the central line which means that they're mostly normally distributed aside from the highest negative residuals which are shown in the bottom left of that figure.

The final figure shows the influence of each individual observation on the model results. As seen, none of the observations have an unreasonably high influence on the model, meaning that there aren't any clear outliers that could be removed to improve it.

par(mfrow = c(2,2))
plot(m, which = c(1,2,5))


Week 3 - Logistic regression

date()
## [1] "Thu Nov 19 21:48:35 2020"

This dataset contains information about students in two Portuguese schools. The data includes various attributes of the students such as their family background and social activities as well as grade data from two classes, math and Portuguese. The original data is available here: https://archive.ics.uci.edu/ml/datasets/Student+Performance

data <- read.table("data/alcohol_is_bad.csv", sep = ";", header = TRUE)
colnames(data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data also contains information about the students' alcohol consumption which gives us the opportunity to study what characteristics might correlate with it. For this, I chose four interesting variables that might or might not explain alcohol consumption.

Now, let's see how they actually compare to alcohol usage.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
g1 <- ggplot(data = data, aes(x=higher, y = alc_use))
g1 + geom_boxplot()

data %>% group_by(higher, high_use) %>% summarise(count = n())
## # A tibble: 4 x 3
## # Groups:   higher [2]
##   higher high_use count
##   <fct>  <lgl>    <int>
## 1 no     FALSE        9
## 2 no     TRUE         9
## 3 yes    FALSE      259
## 4 yes    TRUE       105

From the boxplot it would seem that lesser alcohol consumption correlates with wanting higher education. However, as there are only 18 students out of almost 300 who don't want higher education, no actual conclusions can't very well be drwan and thus my hypothesis remains unproven.

g1 <- ggplot(data = data, aes(x=activities, y = alc_use))
g1 + geom_boxplot()

data %>% group_by(activities, high_use) %>% summarise(count = n())
## # A tibble: 4 x 3
## # Groups:   activities [2]
##   activities high_use count
##   <fct>      <lgl>    <int>
## 1 no         FALSE      122
## 2 no         TRUE        59
## 3 yes        FALSE      146
## 4 yes        TRUE        55

Seems like there is no visible relationship between alcohol usage and extra-curricular activities. This is not that surprising and my original hypothesis was rather weak.

data$famrel <- as.character(data$famrel) #No idea what is wrong with this, I'd imagine it should work with integers as well but I guess not
g1 <- ggplot(data = data, aes(x=famrel, y = alc_use))
g1 + geom_boxplot()

data %>% group_by(famrel, high_use) %>% summarise(count = n())
## # A tibble: 10 x 3
## # Groups:   famrel [5]
##    famrel high_use count
##    <chr>  <lgl>    <int>
##  1 1      FALSE        6
##  2 1      TRUE         2
##  3 2      FALSE       10
##  4 2      TRUE         9
##  5 3      FALSE       39
##  6 3      TRUE        25
##  7 4      FALSE      135
##  8 4      TRUE        54
##  9 5      FALSE       78
## 10 5      TRUE        24

There seems to be a small relationship between particularly good family relationships and less alcohol usage. The less clear relationship in the lower end might be caused by a smaller number of students. This is inline with my original assumptions, yey me.

data$goout <- as.character(data$goout) #No idea what is wrong with this, I'd imagine it should work with integers as well but I guess not
g1 <- ggplot(data = data, aes(x=goout, y = alc_use))
g1 + geom_boxplot()

At least there is a clear positive correlation with alcohol usage and going out with friends. I managed to guess something right.

And now it's midnight and I'm falling asleep. So thanks for grading this far, feel free to move on to the next person.


Week 4 - Clustering and classification

2 - data

The dataset we're handling this week contains data about Boston's suburbs, their housing values and various variables related to that. Here are some examples of the variables included: * population-related variables, such as: student-to-teacher ratio, crime rate, lower status of the population * housing-related variables, such as: number of rooms, value of owner-occupied homes, proportion of owner-occupied units built prior to 1940 * infrastructure, such as: proportion of non-retail business acres per town, distance to Boston's employment centres * environmental variable: nitrogen oxides concentration

The dataset has 14 numerical variables of 506 suburbs of Boston, as shown below.

#Let's read the data
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

#And examine it a bit
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

3 - Graphical overview of data

Let's examine our dataset a bit more closely. From the histograms below, we can see that the different variables are distributed rather differently. Some, such as crime rate (crim) and proportion of black people (black) are heavily skewed towards one end of the spectrum, accessibility to radial highways (rad) and property tax-rate (tax) are skewed towards both high and low values while the rest are somewhat normally or evenly distributed.

library(ggplot2)
library(GGally)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
hist.data.frame(Boston, nclass =10)

And below we see how strongly these variables correlate with each other. Strongest negative correlation can be found between correlate proportion of lower-status population (lstat) and value of owner occupied homes (medv), age and distance to employment centres (dis) as well as dis and concentration of nitrogen oxides (nox). Strongest positive correlation can be found between accessibility to highways and property-tax value.

library(tidyr)
library(corrplot)
## corrplot 0.84 loaded
cor_matrix <- round(cor(Boston), 2)
corrplot(cor_matrix, method="circle", type="upper", tl.pos="d", tl.cex = 0.6)

4 - Data preparation

Next we will scale the data by substracting columns from their means and dividing the result with the column's standard deviation. As can be seen below, this changes each variable's mean to zero.

boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We'll also want the crime rate variable to be categorical which can be achieved easily by dividing it into classes based on its quantiles.

bins <- quantile(boston_scaled$crim)
labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crime)
## 
##      low  med_low med_high     high 
##      127      126      126      127

Finally, we want to be able to use the data for testing predictions so we'll divide the data into training and testing sets.

n <- nrow(boston_scaled)
#We'll divide the data into 80 % training and 20 % testing sets.
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

5 - Linear Discriminant analysis

Next we'll use linear discriminant analysis (LDA), a classification method, to investigate crime rate in more detail.

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The figure above shows a point cloud of the observations in the training data on a plane consisting of the first two LDA-axes. The points are coloured by crime rate.

6 - Prediction

Now we'll test how well our method can predict crime rates based on the testing data. We'll first remove the actual crime rate data from the test data, then predict them and finally compare the model results to the actual data.

#Move the correct crime rates away from the test data
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

#Predict
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       7        1    0
##   med_low    5      20        4    0
##   med_high   0       4       17    3
##   high       0       0        0   26

The model manages to predict high crime rates surprisingly well. Out of 22 high crime rate suburbs, the model gets 21 right. It works also relatively well for low and medium high crime rates (getting 20/26 and 22/28 correct respectively) but doesn't do so well with medium low crime rates of which it gets right only half of the suburbs.

7 - K-means

Finally, we'll do a clustering analysis to the whole dataset using k-means. First we'll check the distances between observations. Then we'll do the analysis with a few different number of clusters to find out the optimal number.

data(Boston)
new_boston <- scale(Boston)
new_boston <- as.data.frame(new_boston)
dist_eu <- dist(new_boston)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(new_boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Based on the plot above, it would look like possibly 2 clusters might be a good number as after that the line starts to decrease more gradually.

Now we'll run the k-means algorithm again and see how it looks like.

km <- kmeans(new_boston, centers = 2)
#Let's divide the dataset to slightly smaller chunks so we can actually see something.

km$cluster <- as.character(km$cluster)
par(mfrow = c(2,2))
ggpairs(new_boston, columns = 1:7, mapping = aes(alpha = 0.5, col = km$cluster))
## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

ggpairs(new_boston, columns = 8:14, mapping = aes(alpha = 0.5, col = km$cluster))

That's a lot of little dots. In general it would seem that the clusters can be distinguished from quite a few scatter plots and distribution graphs. In the first plot, particularly proportion of non-retail business acres and concentration of nitrogen oxides are clearly divided into the two clusters. From the second plot property tax is also divided clearly. These would seem to play an important role in the clustering analysis. On the other hand, for example number of rooms per dwelling doesn't seem to play any role in this cluster analysis.